About this document:

This is an R Markdown Notebook. When opening it in R, it becomes interactive, and you can execute all of the code within this notebook You can then reproduce the results of this document. When viewed in a browser, you can explore the output (e.g., scroll through tables).

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter. Important: many chunks depend on each other - i.e., you can only run later chunks if you have ran the chunks above. So when you get an error message, try to run previous chunks first.

Learning objectives:

Mandatory for everybody

Optional for those who wish to enhance their R skills

Prerequisites:

Setup

Let’s first load the packages so that R knows how to “handle” MySQL, etc.

library(RMySQL) # loads the MySQL package
library(data.table) # loads the data.table package
library(dplyr) # loads the dplyr library
library(tidyr) # loads the tidyr library

Let us now also load some data from the classes’ MySQL server into R. Recall that the database holds a copy of daily data on the Top 200 Charts, as shown on https://spotifycharts.com.

database = dbConnect(MySQL(), user = 'student', password='rsm!2020',
                 dbname='rsm', host = 'sql.tilburg-digital.com')

dbListTables(database) # shows a list with tables in the database
## [1] "artists"  "echonest" "plays"    "tracks"

Our database consists of the four tables listed above. Now, let us explore these tables a bit. Writing the R code to run a particular query will take you a couple of lines - every time you wish to run code! That’s why we first write (from scratch) a couple of functions that will ease data retrieval later on. Don’t worry too much about understanding these functions at this point - just load them.

# Let's define a function to load some sample data from the SQL database server (the first X rows)
get_sample_data = function(tablename) {
  rs = suppressWarnings(dbSendQuery(database, paste0("SELECT * FROM ", tablename, " LIMIT 1000")))
  data = fetch(rs, n = -1)
}

# Let's define a function that will ease running queries on our MySQL database.
run_query = function(query, n = -1) {
  rs = suppressWarnings(dbSendQuery(database, query)) # runs the query
  data = fetch(rs, n = n) # fetches n rows of the result (here: -1 indicates to fetch ALL rows)
  return(data) # returns the result back to the main program
}

Explore the data

We can now use these functions to run some simple queries on our data, e.g., to take a look at three of the tables. By default, the get_sample_data function loads 1000 rows from the SQL database - we do this to save some bandwith.

The artist table

artists <- get_sample_data('artists')
artists

The tracks table

tracks<-get_sample_data('tracks')
tracks

The plays table

plays<-get_sample_data('plays')
plays

Let us also store a copy of the data in actual variables in R, so that we can use them with dplyr and data.table.

# To use the data with data.table, let us convert them to "data.table".
dt_artists = data.table(artists)
dt_tracks = data.table(tracks)
dt_plays = data.table(plays)

# To use the data with dplyr, let's convert them to a so-called "tibble":
df_artists= tbl_df(artists)
df_tracks = tbl_df(tracks)
df_plays = tbl_df(plays)

1. Filtering/querying/subsetting

The first set of operations we’re going to review are “filters”, which sometimes are also referred to as querying (“searching”), or subsetting your data (“showing only a part of what is there”).

There are two types of filters: filtering for particular rows, and filtering for particular columns.

Rows

Let’s start querying for some observations in rows. For example, we could “filter” for all artists that have the name Adele.

SQL

run_query('SELECT * FROM artists WHERE artists.name = "Adele"')

dplyr

To use the dplyr syntax, we have to run our commands on the df_ tables saved above.

filter(df_artists, name == 'Adele')

data.table

To use the data.table syntax, we have to run our commands on the dt_ tables savbed above.

dt_artists[name == 'Adele']

Columns

Let’s now proceed with “filtering” for particular columns. For example, we can only show the “name” column from the artist table. In our SQL queries, we now use the LIMIT option to only retrieve a few rows at a time (which saves us some bandwith).

SQL

run_query('SELECT name FROM artists LIMIT 100')

dplyr

select(df_artists, name) # selects everything

data.table

dt_artists[, c('name')] # selects everything

1. Filtering/querying/subsetting

The first set of operations we’re going to review are “filters”, which sometimes are also referred to as querying (“searching”), or subsetting your data (“showing only a part of what is there”).

There are two types of filters: filtering for particular rows, and filtering for particular columns.

Rows

Let’s start querying for some observations in rows. For example, we could “filter” for all artists that have the name Adele.

SQL

run_query('SELECT * FROM artists WHERE artists.name = "Adele"')

dplyr

To use the dplyr syntax, we have to run our commands on the df_ tables saved above.

filter(df_artists, name == 'Adele')

data.table

To use the data.table syntax, we have to run our commands on the dt_ tables savbed above.

dt_artists[name == 'Adele']

2. Sorting/ordering data

Sometimes, it’s super useful to order your data (or the results of some query). For example, let’s sort the total plays in Argentinaß before May 2015.

SQL

run_query('SELECT * FROM plays WHERE date < "2015-05-01" AND country = "ar" ORDER BY streams ASC') # from lowest to highest streams
#run_query('SELECT * FROM plays WHERE country = "NL" ORDER BY streams DESC') # from highest to lowest streams

dplyr

df_plays %>% filter(country=='ar' & date < as.Date('2015-05-01')) %>% arrange(streams) # ascending order
df_plays %>% filter(country=='ar' & date < as.Date('2015-05-01')) %>% arrange(desc(streams)) # descending order

data.table

setorderv(dt_plays, 'streams') # ascending
dt_plays[country=='ar' & date <= as.Date('2015-05-01')]
setorderv(dt_plays, 'streams', order = -1L) # descending
dt_plays[country=='ar' & date <= as.Date('2015-05-01')]

3. Summarizing/aggregating data

Probably one of the most important operations on data - even before “reshaping” it - is to aggregate/summarize data. Typically, aggregations are used to go from “one unit of analysis to another”.

Think about our plays table, for example. The primary key of that table is date-country-position (each row can be uniquely identified by the combination of a date, country, and position).

Basic aggregation with sum

Suppose now we’d like to “summarize” the data by day - e.g., how many plays in total were there on a given day?

SQL

run_query('SELECT date, SUM(streams) as total_plays FROM plays GROUP BY date LIMIT 10;')

dplyr

df_plays %>% group_by(date) %>% summarise(total_plays = sum(streams))

Note that the result looks a little different. That’s because SQL is using ALL data from our database, whereas the df_plays data only holds a sample of the data.

data.table

dt_plays[, list(total_plays = sum(streams)), by = c('date')]

More advanced functions to summarize data

Of course, there are multiple operations available to summarize date: for example, simply counting rows, counting the number of unique values, or taking the minimum or maximum within a particular group.

Here are a few examples:

SQL

run_query('SELECT COUNT(*) FROM tracks') # how many tracks are listed in the database?
run_query('SELECT COUNT(DISTINCT name) from artists') # how many artists are listed in the database?
run_query('SELECT MIN(streams) FROM plays WHERE country = "ar"') # what were the minimum streams for a song to enter the Argentenian charts?
run_query('SELECT AVG(streams) FROM plays WHERE country = "ar"') # what is the average number of streams for a song in the Argentenian Spotify Top 200?

dplyr

summarise(df_tracks, total_tracks = n())
summarise(df_artists, total_artists = n_distinct(name))
df_plays %>% filter(country == 'ar') %>% summarise(min_streams= min(streams))
df_plays %>% filter(country == 'ar') %>% summarise(avg_streams= mean(streams))

4. Making new variables / deleting existing variables

Many times, we’re in need of creating “new” variables from old variables.

Creating new variables

Dummy variables

For example, this way, we can create a dummy variable, indicating whether a given chart position was in the Top 10 (=1), or not (=0).

SQL

run_query('SELECT country, date, position, streams, (position <= 10) AS top10 FROM plays WHERE country = "ar"')

dplyr

df_plays %>% mutate(top10 = position<=10)

data.table

dt_plays[, top10 := ifelse(position<=10, TRUE, FALSE)]

Continuous variables

Creating continuous variables from scratch in SQL is quite complex; also, you loose a lot of flexibility the R gives you in using extra mathematical functions (after all, R is a statistical language, while SQL is “only” a database query language). Therefore, we give some examples in dplyr and data.table: here, taking the natural logarithm of a variable.

dplyr

df_plays2 = df_plays %>% mutate(ln_streams = log(streams))
df_plays2

data.table

dt_plays2 = dt_plays[, ln_streams := log(streams)]
dt_plays2

Deleting variables

Deleting variables in SQL doesn’t make sense (unless you’re a database manager and would like to seriously change the structure of the database). Below, we show you how you can delete variables in dplyr and data.table.

dplyr

df_plays2 %>% select(-ln_streams)

data.table

dt_plays2[, ln_streams := NULL]
dt_plays2

5. Reshaping data

In wrangling with data, you frequently need to change the format of your data to “fit” the input requirements of some analysis methods.

Prepare some example data

For this example, let us retrieve some data the number of plays for Adele and Drake across countries. Let’s not worry too much about the query to retrieve that data (it’s quite complicated). Instead, we focus on the reshaping operations later.

bycountry = run_query("SELECT artists.name as name, country, SUM(streams) as total_plays FROM plays LEFT JOIN tracks ON tracks.id = plays.track_id LEFT JOIN artists ON artists.id = tracks.artist_id WHERE artists.name IN ('Adele', 'Drake') GROUP BY artists.name, plays.country")

# let us also prepare this data for dplyr and data.table
df_bycountry = tbl_df(bycountry)
dt_bycountry = data.table(bycountry)

bycountry # the head() and tail() commands would only show you the top (bottom) six rows

Explore the data a bit (e.g., on the first page, on the last page). What do you see?

From long to wide

You probably agree it’s quite length to scroll through the table, as the table is very long. That’s why the format of this table is called “long” format: we use many rows to store the result of Adele and Drake, for each country. The primary key of this long-shaped data set is name and country, as every row can be uniquely pointed to by the combination of these column names.

Now let’s suppose we wanted to prepare an overview for a music manager on the performance of these two artists across all countries. Further, let’s suppose the manager is interested in learning in which country any of these two artists wasn’t making the Top 200 at all.

You probably agree that the long-shaped format isn’t very well suited for this - or would you want the manager to scroll through dozens of result pages? Right. Therefore, to prepare a good overview, we convert the data from its “long” format to the so-called “wide” format (which basically means: many columns).

Watch this! (Note we’re going to convert the result from above to tibbles and data.tables first)

dyplr

result1 = spread(df_bycountry, country, total_plays)
result1

data.table

result2 = dcast(dt_bycountry, name~country, value.var='total_plays')
result2

From wide to long

Sometimes, we also just wish to go back from wide to long. Let’s use the results from above as input to convert those back to the long format.

dplyr

result1b = gather(result1, 'country', 'total_plays', -name)
result1b

data.table

result2b = melt(result2, c('name'), variable.name='country', value.name = 'total_plays')
result2b

Uniting columns into one

We can use code to unite columns into one. For example, let’s create an overview about the Top 3 countries in terms of plays for each of the artists.

Let’s first create an indicator variable for the Top 3, filter on it.

# we use data.table for the conversion
setorderv(result2b, c('name', 'total_plays'), order = c(1, -1), na.last=T) 
# sort in ascending order by name, and descending order by 

result2b[, rank := 1:.N, by = c('name')] # create rank positions
# filter for ranks <= 3
result2b = result2b[rank <= 3]

result2b

Now we can combine the resulting country names with their corresponding streams in one column.

dplyr

result3a <- unite(result2b, 'country_streams', country, total_plays, sep = ' ')
result3a

data.table

# let us directly create the concatenated variable in the data.table
result3b = result2b
result3b[, country_streams := paste(country, total_plays)]
result3b

Separating column values into multiple columns

Now let’s suppose we only see the combined (comma-separated) values in the country_streams column in the example above. How would we separate those values back into individual columns?

Let us first “delete” the individual country/streams columns from the data.

tmp=result3b
tmp=tmp[, ':=' (country=NULL, total_plays=NULL)] # let's first remove the desired result
tmp

Watch how here:

dplyr

result4a = separate(tmp, country_streams, c('country','streams'), sep =' ')
result4a

data.table

# now let's extract the country names from the combined column
tmp[, country:=sapply(strsplit(country_streams, ' '), function(x) x[1])]

# now let's extract the streams from the combined column, converting them back to numerics
tmp[, total_plays:=as.numeric(sapply(strsplit(country_streams, ' '), function(x) x[2]))]

result4b = tmp

result4b

6. Combine data sets

Combining data sets is an essential component of preparing a data for analysis.

Row binds

The “most” simple combination is a row bind. Suppose you have data set A with 100 rows, and data set B with 150 rows (and both of these data sets have exactly the same column names), then you can bind these data sets together in the same way you would glue two books together. The resulting book has 250 pages.

Let’s first prepare two separate data sets (we just take two different rows from the artist dataset.

datasetA = df_artists %>% filter(name=='Adele')
datasetB = df_artists %>% filter(name=='AC/DC')

dplry

bind_rows(datasetA, datasetB)

data.table

rbindlist(list(datasetA, datasetB))

Joins/merges

In simple terms, merging (or joining) dataset A with dataset B means that you add columns from one dataset to the other dataset, given some corresponding key columns in both data sets.

Suppose you have a dataset A with artist names, and a dataset T with track names, and both of these data sets have an ID column for artists, then you can “add” the artist names to the track data set using a join. Specifically, a LEFT JOIN.

Important

To be able to merge datasets, they need to be at the same (or higher) level of aggregation. For example, if you have data set A which measures a playlist followers in weeks, and another data set B that measures the content of the playlists every day, then you can’t merge these two datasets directly.

However, you could do the following:

  • Aggregate/summarize data set B at the weekly level (e.g., average variables); the resulting dataset B now has the same level of aggregation than dataset B, and you can merge.
  • If your primary focus is to analyze your data at the daily level, then you could “blow up” data set A by inserting 7 rows (for each day of the week) for each week in that data, and linearly interpolate the number of followers. Now, both data sets have the same aggregation level (daily), and you can go ahead and merge them.

Left joins

Left joins add the columns of dataset B to dataset A, and leaves the structure of dataset A intact.

SQL

run_query('SELECT tracks.name as track_name, artists.name as artist_name FROM tracks LEFT JOIN artists ON tracks.artist_id = artists.id')

dplyr

left_join(tracks, artists, by = c('artist_id' = 'id'))

data.table

merge(tracks, artists, by.x='artist_id', by.y='id', all.x=T)

Right joins

Right joins work the other way around: it keeps all data in the right table (dataset B), and joins available data from the left table (dataset A)

SQL

run_query('SELECT tracks.name as track_name, artists.name as artist_name FROM artists RIGHT JOIN tracks ON tracks.artist_id = artists.id')

dplyr

right_join(artists, tracks, by = c('id' = 'artist_id'))

data.table

merge(artists, tracks, by.x='id', by.y='artist_id', all.y=T)

Inner joins

Inner joins keeps only rows in both data sets.

SQL

run_query('SELECT tracks.name as track_name, artists.name as artist_name FROM artists INNER JOIN tracks ON tracks.artist_id = artists.id')

dplyr

inner_join(tracks, artists, by = c('artist_id' = 'id'))

data.table

merge(tracks, artists, by.x='artist_id', by.y = 'id')

Special cases

Here, we discuss a few “special” joins.

Filling in data

Let’s take a look at the plays for the track “El Perdon”. Notice the track has mostly been playing in Argentina and Australia, sometime starting in 2015.

run_query("SELECT * from plays WHERE track_id = '4Ld3RS6KBXGsvtfmnjxlVl'")

Suppose now you’d like to prepare a panel data set in which for each day in 2015 and 2016, you’d see the number of streams for that particular song in Argentina and Australia. Merely using the SQL query above isn’t enough - two years have 365*2 days, and the resulting rows above only list data on two particular dates in a year.

What we then require first is an “empty” dataset, only listing the country, and date values for each day in 2015 and 2016.

Let’s create that data first:

countries = c('ar', 'au')
dates = seq(from=as.Date('2015-01-01'), to = as.Date('2016-12-31'), by = '1 day')

tmp = lapply(countries, function(x) data.table(country=x, date=dates))
keydata = rbindlist(tmp)
keydata

Browse this “empty” data set above. Can you confirm it lists all dates in 2015 and 2016, for both countries?

We can now go ahead and merge our song data, using a LEFT JOIN (we want to keep all rows in A, and only merge the rows to it when the song was actually in the Top 200).

dplyr

# Version 1
tmp =  df_plays %>% filter(track_id == '4Ld3RS6KBXGsvtfmnjxlVl') %>% mutate(date=as.Date(date))
left_join(keydata, tmp) %>% select(country, date, streams) %>% replace_na(list(streams=0))
## Joining, by = c("country", "date")
# Version 2 is even more elegant/readable
tmp %>% complete(date= seq.Date(from = as.Date('2015-01-01'), to = as.Date('2016-12-31'), by = '1 day')) %>% select(country, date, streams)

Scroll down to about April/May 2015: see those streams there?

data.table

merged_data = merge(keydata, tmp, by = c('country', 'date'), all.x=T)
merged_data[is.na(streams), streams := 0] # replace NAs with 0s
merged_data[, c('country', 'date','streams'),with=F] # only show selected columns

7. Loading and saving

Alright - we’re done “wrangling” with our data. It’s time to save it.

Saving data

CSV files

Saving your data in CSV files has several advantages: - program independence: you can open the data in any software (e.g., R, Python, SPSS, …) - the data is likely to be accessible in the future: CSV files have a very simple structure that you can even see while merely looking at it in a text editor.

The disadvantages are: - If you’re working with special encodings (e.g., special characters), CSV files many times break when you try to open them in other software. For example, if you open a CSV file in Excel and save it again, it’s format will be changed and R may encounter difficulties reading it in.

Here, we by all means prefer data.table due to its unprecedented speed in writing data to the disk.

data.table

# Let's compare a regular write (100 times)...
system.time({for (i in 1:100) write.table(plays, 'plays.csv')})
##    user  system elapsed 
##   0.484   0.031   0.228
# To 100 times a data.table fwrite
system.time({for (i in 1:100) fwrite(plays, 'plays.csv')})
##    user  system elapsed 
##   0.026   0.012   0.038
# It's save to assume fwrite always beats R's own write.table. 
# So, let's write it to the disk.
fwrite(plays, 'plays.csv')

new_data <- fread('plays.csv')

RData

Sometimes it’s also convenient to store data directly in the RData format. You can also save more than one object in a file. The advantage is you can read in all data at the same time, and the data is also automatically zipped. The disadvantage is that you need R to read in that data (so no independence!), and it may take a while to load in the data (way longer than reading in three CSVs!).

save(plays, artists, tracks, file = 'my_data.RData')

Reading data

CSV files

Let’s now read in our saved data.

my_plays <- fread('plays.csv')
head(my_plays)
# remove our existing data first
rm(plays, artists, tracks)

load('my_data.RData')

head(plays)
head(artists)
head(tracks)